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We study the problem of estimating high dimensional regression models regularized by 
a structured-sparsity-inducing penalty that encodes prior structural information on either 
input or output sides. We consider two widely adopted types of such penalties as our 
motivating examples: 1) overlapping-group-lasso penalty, based on the ii/i2 mixed-norm 
penalty, and 2) graph-guided fusion penalty. For both types of penalties, due to their 
non-separability, developing an efficient optimization method has remained a challenging 
problem. In this paper, we propose a general optimization approach, called smoothing 
proximal gradient method, which can solve the structured sparse regression problems with 
a smooth convex loss and a wide spectrum of structured-sparsity-inducing penalties. Our 
approach combines the smoothing technique and the proximal gradient method. It achieves 
a convergence rate significantly faster than the standard first-order method, subgradient 
method, and is much more scalable than the most widely used interior-point method. The 
efficiency and scalability of our method are demonstrated on both simulated and real genetic 
datasets. 

Keywords: Smmoothing Proximal Gradient, Structured Sparsity, Overlapping Group 
Lasso, Graph-guided Fused Lasso 

1. Introduction 

The problem of high-dimensional sparse feature learning arises in many areas in science and 
engineering. In a typical setting, the input lies in a liigh-dimensional space, and one is inter- 
ested in selecting a small number of input variables that influence the output. A popular ap- 
proach to achieve this goal i s to jointly optim ize the fitness loss function with a non-smooth 
^i-norm penalty (e.g., lasso ( Tibshir anil . 119961 )) that shrinks the coefficients of the irrelevant 
input variables to zero. However, this approach is limited in that it treats each input as 
independent of each other and hence is incapable of capturing any structural information 
among input variables. Recently, various extensions of the lasso penalty have been intro- 
duced to take advantage of the prior knowle dge of the structure among inputs to encourage 
closely related inputs to be selected jointly (JYuan and Linl . l2006l : iTibshirani and Saundersl . 
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2OO5I : IJenatton et all l2009l ). Similar ideas have also been explored to leverage the output 
structures in multivariate regression, where one is interested in estimatin g multiple related 



funct ional mappings from a common input space to multiple outputs (jObozinski et al. 
2003 ) ■ In this case, the structure over the outputs is available as prior knowledge, and 



the closely related out puts according to thi s structure are encouraged to share a similar 
set of relevant inputs ( Kim and Xind . l20ld ). Despite these progress, developing an effi- 
cient optimization method for solving the convex optimization problems resulting from the 
structured- spar sity-inducing penalty functions has remained a challenge. In this paper, we 
focus on the problem of developing efficient optimization methods that can handle a broad 
set of structured-sparsity-inducing penalties with complex structures. 

When the structure to be imposed h as a relatively s imple form, such as non-overlapping 
groups over variables (e.g., group l asso lYuan and Lid (120061^ ') , or a linear-ordering (a.k.a., 
chain) of variables (e.g., fused lasso ( Tibshirani and Saundersl . 12005)), efficient optim ization 
methods have been developed. For example, under group lasso (JYuan and Linl . 120061 ). due to 
the separability among groups, a proximal operato)^ associated with the penalty can be com - 



puted in a closed-form; thus, a number of composite gradient methods (JBeck and Teboulld . 
2003 : iNesterovl . 120071 : iLiu et al.l . l2009l ) that leverage the proximal operator as a key step 
(so-called "proximal gradient method") can be directly applied. For fused lasso, although 
the penalty is not separable, a coordinate descei it algorithm was shown feasible by explicitly 



leveraging on the linear ordering of the inputs ( Friedman et al.l . 120071 ). 



In order to handle a more general class of structures such as tree or graph, various 
models that further extend group lasso and fused lasso have been proposed. While the 
standard group lasso assumes groups are separated, overlapping group lasso, which intro- 
duces overlaps among groups so that each input can belong t o multiple groups, allo ws us to 
incorporate more complex prior knowledge on the structure ( Jenatton et al.l . l2009l ) . As for 
fused lasso, graph-guided fused lasso extends the chain structure to a gen eral graph, where 
the fusion penalty is applied to each edge of the graph ( Kim et all . |2009J). However, due to 
the non-separability of the penalty that arises from overlapping groups or graphs, the fast 
optimization method for the standard group lasso or fused lasso cannot be easily applied 
(e.g. no closed- form solution of the proximal operator). In principle, generic solvers such 
as the interior-point methods (IPM) could always be used to solve either a second-order 
cone programming (SOCP) or a quadratic programming (QP) formulation of the afore- 
mentioned problems, such approaches are computationally prohibitive even for problems of 
moderate size. Very recently, this problem has received excellent attentions from a num- 
ber of papers ( Jenatton et al.l . l20ld : iMairal et al.l . l20ld : IPuchi and Singer! . l2009l : iLiu et al. 



20ld : iHoeflind . l2009l ) which all strived to provide clever solutions to various subclasses of 



the structured-sparsity-inducing penalties; but as we survey in Section 2, they are still shy 
of reaching a simple, unified, and general solution to a broad class of structured sparse 
regression problems. 

In this paper, we propose a generic optimization approach, the smoothing proximal 
gradient method, for dealing with a variety of structured-sparsity-inducing penalties, using 
overlapping-group-lasso penalty and graph-guided fusion penalty as motivating examples. 
Although these two types of penalties are seemingly very different, we show that it is 



1. The proximal operator associated with the penalty is defined as arginino ^\\/3 — v||2 + P{f3), where v is 
any given vector and P(/3) is the non-smooth penalty. 
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possible to decouple the non-separable terms in both penalties via the dual norm; and re- 
formulate them into a common form to which the proposed method can be applied. We 
call this approach "smoothing" proximal gradient method because instead of optimizing the 
original problem directly as in other proximal gradient methods, we introduce a smooth ap- 
proximation of t he structured-sparsity- inducing penalty using the smoothing technique from 
( Nesterovl . l2005l ) . Then, we solve the approximation problem by the first -order proximal gra- 



dient method: fast iterative shrinkage-thresholding algorithm (FISTA) (lBeck and Teboulle 



2OO9I ) ■ It achieves 0(|) convergence rate for a desired accuracy e. Below, we summarize 



the main advantages of this approach. 

(a) It is a first-order method, as it uses only the gradient information. Thus, it is signif- 
icantly more scalable than IPM for SOCP or QP. Since it is gradient-based, it allows 
warm restarts, thereby pote ntiates solving the problem along the entire regularization 



path (JFriedman et al.l . 120071 ) . 



(b) It is applicable to a wide class of optimization problems with a smooth convex loss 
and a non-smooth non-separable structured sparsity-inducing penalty. The mixed- 
norm penalties as in overlapping-group-lasso and graph-guided fusion penalty are only 
examples for demonstration. 

(c) Theoretically, it enjoys a convergence rate of O(-), which dominates that of standard 
first-order method and subgradient method with rate O(^). 

(d) It is applicable to both uni- and multi-variate regression, with structures on either (or 
both) inputs/outputs. 

(e) Easy to implement with a few lines of MATLAB code. 

The rest of this paper is organized as follows. In Section 2, we present the of overlapping 
group lasso and graph-guided fused lasso and discuss the existing optimization methods in 
the literature. In Section 3, we present the smoothing proximal gradient method along 
with complexity results. In Section 4, we present the generalization of this algorithm to the 
setting of the multivariate regression. In Section 5, we present numerical results on both 
simulated and real datasets, followed by conclusions in Section 6. Throughout the paper, we 
will discuss overlapping-group-lasso and graph-guided fusion penalties in parallel to illus- 
trate how the smoothing proximal gradient method can be used to solve the corresponding 
optimization problems. 

2. Background: Linear Regression Regularized by 

Structured-sparsity-inducing Penalties and Related Optimization 
Methods 

We begin with a brief review of the high-dimensional linear regression model regularized 
by structured-sparsity-inducing penalties. To better position the proposed method, we also 
provide a survey of latest developments of efficient algorithms for solving subclasses of 
structured sparse regression problems. 

Let X G M^^"^ denote the input data for N samples, where each sample lies in J 
dimensional space, and y G M^^^ be the output data. We assume a linear regression 
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Table 1: Comparisons of different first-order methods for optimizing mixed- norm based 
overlapping-group-lasso penalties. 



Method 


No overlap 

111 (.2 


No overlap 

ill toe 


Overlap 
Tree li/i2 


Overlap 
Tree h/iao 


Overlap 
Arbitrary 

h/1^2 


Overlap 
Arbitrary (.i/ioo 


Projection llLiu et all 


0{J) 


0(J log J) 


N.A. 


N.A. 


N.A. 


N.A. 


l2009|) 


Coordinate Ascent 
dJenatton et al.l, I2OIOI; 


o{J) 


0(J log J) 




0{j:,^g\g\log\g\) 


N.A. 


N.A. 


Liu and Yd. l2010bl) 


Network Flow 


N.A. 


quadratic 

min-cost 

flow 


N.A. 


( -T= ) , quadratic 
min-cost flow 


N.A. 


0(-T=), quadratic 
min-cost flow 


((Mairal et al.l. l2010t) 


FOBOS 


0{^),0(J) 


0(7). 
0(J log J) 


0(7). 
0(T.,^,\g\) 


0{j:,^g\g\log\g\) 


0(E,ggl9l) 
(subgradient) 


O(-i), quadratic 
min-cost flow 


dDuchi and Singerl. 


I2OO9I) 


Smoothing Prox-Grad 


0{\),0{J) 


o{\), 

0{J log J) 


0(7). 
0(E<,ecM) 


0(i), 
OCEaealslloglsl) 


o{i), 

0(E„eal9l) 


o(i), 

0(E„gql9|logl9l) 



model, y = X/3 -|- e, where /3 is the vector of length J for regression coefficients and e is the 
vecto r of length N for noise distributed as A^(0, ct'^Inxn)- The standard lasso (JTibshiranil . 
19961 ) obtains a sparse estimate of the coefficients by solving the following optimization 
problem: 

minfl(/3) + A||/3||i, (1) 



where g{l3) 



y — X/3II2 is the squared-error loss, ||/3||i = S,=il/3j| is the .^i-norm 



penalty that encourages the solutions to be sparse, and A is the regularization parameter 
that controls the sparsity level. 

While the standard lasso penalty does not assume any structure among the input vari- 
ables, recently various extensions of the lasso penalty have been proposed that incorporate 
the prior knowledge on the structure over the input variables to estimate a joint sparsity 
pattern among related inputs. We broadly call the structured-sparsity-inducing penalty 
r2(/3) without assuming a specific form, and define the problem of estimating a structured- 
sparsity pattern in the coefficients as follows: 



mmf{P)^g{P) + n{P) + X\\(3\\ 



1- 



(2) 



As examples of such penalties, in this paper, we consider two broad categories of penal- 
ties ri(/3) based on two different types of functional forms, namely overlapping-group-lasso 
penalty based on the ^1/^2 mixed-norm and graph-guided fusion penalty. As we discuss 
below, these two types of penalties cover a broa d set of structured-sparsity-inducing penal- 
ties that have been introduced in the li t erature ( Yuan and Lin ^ 2006 ;_ Jenatton et al. ^ 20091 : 
Kim and Xinel . I2OIOI : Izhao et al.l . I2OO9I : iTibshirani and Saundersl . l200,4 iKim et all hOQ^ 



1. Overlapping-group-lasso Penalty 

Assuming that the set of groups of inputs G = {gi, . . . , ^igi} is defined as a subset of 
the power set of {1, . . . , J}, and is available as prior knowledge. Note that members 
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of Q (groups) are allo wed to overlap. The o verlapping-group-lasso penalty based on 
the £1/^2 mixed- norm (jJenatton et alj . l2009l ) is defined as: 



niP) 






(3) 



where (3 € M'^' is the subvector of /3 for the inputs in group g; Wg is the predefined 
weight for group g; and || • II2 is the vector ^2-iiorm. We assume that each group has 
at least two variables since otherwise the penalty for that group can be incorporated 
into the £i-norm in ([2]). The £1/^2 mixed- norm penalty r2(/3) plays the role of setting 
all of the coefficients within each group to zero or non-zero values. It is worthy to 
note that the ii/ioo mixed-norm penalty can also achieve the similar grouping effect. 
Although our approach can also be used for the ii/ioo as well, in this paper, we focus 
on the ii/i2 penalty and the comparison between the £1/^2 and the ii/ioo is beyond 
the scope of the paper. 

Most of the existing optimization methods developed for mixed-norm penalties can 
handle only a specific subclass of the general overlapping- g roup- lasso penalties . Most 
of these metho ds use the proximal gradient framework ( Beck and Teboulld . l2009l : 



Nesterovl . 120071 ) and focus on the issue of how to exactly solve the proximal operator. 
For non-overlapping groups with the ^1/^2 or i^ jino mixed- r iorms, the proximal oper - 



ator can be solved via a simple projection (JLiu et al.l . l2009l : IPuchi and Singed . l2009l ) . 



A one-pass coordinate asc ent method has been developed fo r tree-st ructured groups 
with the £1/^2 or ixjioo (JJenatton et al.l . l20ld : iLiu and Yd . l2010bl ). and quadratic 
min- cost network flow for arbitrary overlapping groups with the ixjioo (jMairal et al. 
20inl V 



Table [J summarizes the applicability, the convergence rate, and the per- iteration time 
complexity for the available first-order methods for different subclasses of group lasso 
penalties. More specifically, the first three rows adopts the proximal gradient frame- 
work and the first column of these rows gives the algorithms for solving the proximal 
operator and the corresponding references. Each entry contains the convergence rate 
and the per-iteration time complexity. For the sake of simplicity, in all of the entries, 
we omit the time for computing the gradient of the loss function which is needed 
for all the methods (i.e., \/g{(3) with 0{J'^)). The per-iteration time complexity in 
the table may come from the computation of proximal operator or subgradient of the 
penalty. "N.A." standards for "not applicable" or no guarantee in the convergence. 
For comparison, we show the same information of the proposed method in the last 
row of Table [TJ Although our method is not the most ideal one fo r some of the special 
cases, our method along with FOBOqj (JDuchi and Singerl . l2009l ) are the only generic 
methods that can be applied to all subclasses of the penalties. 



2. We note that in contrast to other proximal gradient methods which require the loss function (i.e. first 
part of the objective) to be smooth, FOBOS is a more general framework in that it allows the non- 
smoothness of the loss function. It performs the subgradient descent on the loss and solves the proximal 
operator associated with the penalty. If the loss function is smooth and the proximal operator can be 
solved exactly, it achieves O(-) convergence rate. Otherwise, if the proximal operator cannot be exactly 
solved, FOBOS reduces to the subgradient descent on the sum of the loss and the penalty with 0(-^) 
convergence rate. 
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As we can see from Table[Tl for ar bitrary overlaps with the ii /ioo , although the method 
proposed in ( Mairal et all I2OI0I ) achieves 0(4j) convergence rate, the per-iteration 
complexity could be high due to solving a quadratic min-cost network flow problem. 
From the worst-c ase analysis, the per- iteration time complexity for solving the network 
flow problem in (JMairal et alJ . l20ld ^ is at least 0(|F||^|) = 0{{J + \g\){\g\ + J + 
Sgeelsl))' which is much higher than our method with 0{'^^g \g\ log \g\). More 
importantly, for the case of arbitrary overlaps with the ii/i2, our method has a 
superior convergence rate to all the other methods. 

In addition to these methods, an active-set algorithm was pr oposed that applies t o 



the square of the ^1/^2 mixed- norm with overlapping groups ( Jenatton et al.l . |2009| ). 



This method formulates each subproblem involving only the active variables either 
as an SOCP, which can be computationally expensive for a large active set, or as a 
jointly convex problem with auxiliary variables, which is then solved by an alternating 
gradient descent. The latter approach involves an expensive matr ix inversion at each 
iteration and lacks the global convergence rate. Another method (JLiu and Yd . l2010al ) 
was proposed for overlapping group lasso which approximately solves the proximal 
operator. However, the convergence of such type of approach cannot be guaranteed, 
since the error introduced in ea ch proximal o perat or will be accumulated over itera- 
tions. A primal-dual algorithm (IMosci et al.l. 120101) was p roposed to solve a different 



formulation of group lasso problem in (I Jacob et al.l . l2009l ) 



2. Graph-guided Fusion Penalty 

Let us assume the structure of J input variables is available as a graph G with a set 
of nodes V = {1, . . . , J} and a set of edges E. Let Vmi G K denote the weight of 
the edge e = (m, /) G E, corresponding to the correlation between the two inputs for 
nodes m and /. Then, the graph-guided fusion penalty as defined below genera lizes 
chain-structured fused-lasso penalty proposed by iTibshirani and SaundersI (120051 ): 



e=(m,l)£E,m<l 



signer ml) Pi 



(4) 



where T{r) weights the fusion penalty for each edge e = {m, I) such that /3m and /3i for 
highly correlated inputs with larger \rmi\ receive a greater fusion effect. In this paper, 
we consider T{r) = \r\, but any monotonically increasing function of the absolute 
values of correlations can be used. The sign{rmi) indicates that for two positively 
correlated nodes, the corresponding coefficients tend to be influence the output in the 
same direction, while for two negatively correlated nodes, the effects {(3m and (3i) take 
the opposite direction. Since this fusion effect is calibrated by the edge weight, the 
graph-guided fusion penalty in (j4j) encourages highly correlated inputs corresponding 
to a densely connected subnetwork in G to be jointly selected as relevant. We notice 
that if rmi = 1 for all e = (m, I), the penalty function in (j4]) reduces to: 



^(/3)=7 E 1/3- -Al 



(5) 



e={m,l)£E ,m<l 



The standard fused lasso penalty 7X]j=i l/^j-i- i — 0j\ is special case of (ISl). w here 
the graph structure is confined to be a chain ( Tibshirani and SaundersI |2005| ) and 
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the widely used fused signal approximator refers to the simple case where the design 
matrix X is orthogonal. 

For the graph-guided fusion penalty, when the str ucture is a simple chain, pathwise 



coordinate descent method (JFriedman et alj . 120071 ) can be applied. For the general 



graph structur e, a first-o r der rn ethod that approximately solves the proximal operator 
was proposed ( Liu et al.l . I2OI0I ). However, the convergence cannot be guaranteed due 



to the errors introduced in computing the proximal operator over iterations. Recently, 
a different alg orithm has been proposed that reformulates the problem as a maximum 
flow problem ( Hoefiingl . 12003 ). This algorithm is limited in that it can not be applied 



to the case where the dimension is more than the sample size which is the typical 
setting for spa rse regression problern s . An other work for solving the graph-guided 
fused lasso in ( Tibshirani and Taylorj . l20ld ) also suffers from the same problem. In 



addition, all these work lack the theoretical analysis of the convergence. 

3. Smoothing Proximal Gradient 

In this section, we present the smoothing proximal gradient method. The main difficulty in 
optimizing ([2]) arises from the non-separability of (3 in the non-smooth penalty ^{(3). For 
both types of penalties, we show that using the dual norm, the non-separable structured- 
sparsity-inducing penalties ^{(3) can be formulated as Q{(3) = maXaeQCt^CP- Based on 
that, we introduce a smooth approximation to r2(/3) such that its gradient can be easily 
calculated. 

3.1 Reformulation of the Structured-sparsity-inducing Penalty 

In this section, we show that despite the non-separability, by using the dual norm, both 
types of the structured-sparsity-inducing penalties in ([3]) and ^ can be decoupled once 
we reformulate them into the common form of a maximization problem over the auxiliary 
variables. 

1. Overlapping-group-lasso Penalty 

Since the dual norm of ^2-iiorm is also an ^2-iiorm, we can write ||/3„||2 as ||/3„||2 = 
maxiiQ ||2<i ctqPn, where ctg G W^' is the vector of auxiliary variables associated with 

. Then, ct is a vector of length X^gg^ \g\ with domain 

Q = {a \ ||Q;g||2 < 1, V5 G G}, where Q is the Cartesian product of unit balls in 
Euclidean space and thus, a closed and convex set. We can rewrite the overlapping- 
group-lasso penalty in ([3]) as: 

0(/3)=7> Wg max a /3„ = max> jWga f3„ = max. cx.^Cf3, (6) 

where C G R^sesl^l^ is a matrix defined as follows. The rows of C are indexed 
by all pairs of {i,g) G {{i,g)\i £ g,i £ {!,..., J}}, the columns are indexed by 
j G {1, . . . , J}, and each element of C is given as: 

(*'^)'J \ otherwise. ^^ 

7 



T 

/3„. Let a 



9' 



T T 

"51' • • • ' 9|ei 
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Note that C is a highly sparse matrix with only a single non-zero element in each row 
and X^aeC 1^1 non-zero elements in the entire matrix, and hence, can be stored with 
only a small amount of memory during the optimization procedure. 

2. Graph-guided Fusion Penalty 

First, we rewrite the graph-guided fusion penalty in (j4|) as follows: 

7 X] T{rmi)\Pm - sign{rrni)l3i\ = \\CP\\i, 

e={m,l)£E,m<l 



where C G MI^I^'' is the edge- vertex incident matrix: 



a 



e=(m,l),j 



1-T{rmi) if j = m 

-7 • sign{rmi)T{rmi) if i = / (8) 

otherwise. 



Again, we note that C is a highly sparse matrix with 2 • |ii^| non-zero elements. 

Since the dual norm of the £oo-norm is the £i-norm, we can further rewrite the graph- 
guided fusion penalty as: 

\\CP\\i= max a'^C(3, (9) 

\\a\\oo<l 

where a G Q = {q;|||q;||oo < 1,Q: G M' '} is a vector of auxiliary variables associated 
with ||C/3||i, and || • ||oo is the ^oo-norm defined as the maximum absolute value of all 
entries in the vector. 

Remark 1 As a generalization of graph-guided fusion penalty, the proposed opti- 
mization method can be applied to the ii-norm of any linear mapping of j3 (i.e., 
n{f3) = \\Cf3\\i for any given C). 

3.2 Smooth Approximation of Structured-sparsity-inducing Penalty 

The common formulation of ^{(3) (i.e., ^{(3) = max^ggo; C/3) is still a non-smooth func- 
tion of f3, and this makes the optirn i zation challenging. To tackle this problem, using the 



smoothing technique from (JNesterovl . l2005l ). we construct a smooth approximation of i^{f3) 

as following: 

L(/3) = max (a'^Cl3 - fid(a)) , (10) 

cteQ 

where /x is the positive smoothness parameter and d{Q.) is defined as 2l|o;|l2- The original 
penalty term can be viewed as f^{f3) with /i = 0. It is easy to see that f^{P) is a lower 
bound of /o(/3)- In order to bound the gap between /^(/3) and /o(/3), let D = maxcgg d{a). 
In our problems, D = \G\/2 for the overlapping-group-lasso penalty and D = \E\/2 for the 
graph-guided fusion penalty. Then, it is easy to verify that the maximum gap between 
/^(/3) and /o(/3) is fiD: 

/o(/3)-AiZ)<^(/3)</o(/3). 

From Theorem 1 as presented below, we know that f^{l3) is a smooth function for any // > 0. 
Therefore, f/j^iP) can be viewed as a smooth approximation of /o(/3) with the maximum gap 
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(b) 

Figure 1: A geometric illustration of the smoothness of //i(/3)- (a) The 3-D plot of z[a, /?), (b) the 
projection of (a) onto the /3-z space, (c) the 3-D plot of Zs(a,/3), and (d) the projection 
of (c) onto the P-z space. 



of /iZ), and the /i controls the gap between f^{P) and fo{(3). Given the desired accuracy e, 
the convergence result in Section 13.41 s uggests u = jp to achieve the best convergence rate. 



2OO5I ) to show that ffj,{(3) is smooth in /3 



Now we present the key theorem (JNesterov 
with a simple form of the gradient. 



Theorem 2 For any fx > 0, ffi{(3) is a convex and continuously- dijferentiable function in 
(3, and the gradient of f^{(3) takes the following form: 



Vf^i(3) = C^a*, 



(11) 



where ex* is the optimal solution to (jlOp . Moreover, the gradient Vf^{f3) is Lipschitz con- 
tinuous with the Lipschitz constant L^ = -||C|p, where \\C\\ is a special norm of C defined 



as \\C\\ = max||v||2<i ||C'v||2. 



This theorem is given in (JNesterovl . l2005l ) but without a detailed proof of smoothness 
and a derivation of the gradient. Just for the purpose of completeness, we show that by 
viewing f^{l3) as the Fenc hel Conjuga t e of d {-) at — ^, the smoothness can obtained by 
applying Theorem 26.3 in ( Rockafellai . 119961 ) . The gradient in dill) can be derived from 



the D anskin's Theorem ( Bertsekad . Il999l ) and the Lipschitz constant is shown in ( Nesterovl . 



2OO5I ). The details are discussed in the appendix. 



To provide insights on why /^(/3) is a smooth function as Theorem 1 suggests, in Figure 
[H we show a geometric illustration for the case of one-dimensional parameter (i.e., /3 G M). 
For the sake of simplicity, we assume that /i and C are set to 1. First, we show geometrically 
that /o(/3) = max^gr^ i] z(a,/3), where z{a,(3) = a/3, is a non-smooth function. The three- 
dimensional plot for z{a,(3) with a restricted to [—1, 1] is shown in Figure [Dj^a) . We project 
the surface in Figure[T]^a) onto the (3 — z space as shown in Figure [Dj^b). For each /3, the value 
of /o(/S) is the highest point along the z-axis since we maximize over a in [—1, 1]. We can 
see that /o(/3) is composed of two segments with a sharp point at /3 = 0. Now, we introduce 
the auxiliary function, and let Zs(a,/3) = a/3 — ^a^ and /^(/3) = max„g[_i 1] Zs{a,(3). The 
three-dimensional plot for Zs{a,l3) with a restricted to [—1,1] is shown in Figure [T][c). 
Similarly, we project the surface in Figure \V[c) onto the (3 — Zs space as shown in Figure 
[TJ^d). For fixed /3, the value of /^(/3) is the highest point along the z-axis. In Figure [I][|d) , 
we can see that f^{fi) is composed of three parts: (i) a line with slope —1 when /3 < 1, 
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(ii) a line with slope 1 when /3 > 1, and (iii) a quadratic function when — 1 < /3 < 1. By 
introducing an auxiliary quadratic function, we remove the sharp point at /3 = and /^(/3) 
becomes a smooth function. 

To compute the V/^(/3) and L^, we need to know a* and ||C||. We present the closed- 
form equations for a* and ||C|| for the overlapping-group-lasso penalty and graph-guided 
fusion penalty in the following propositions and the proof is presented in the appendix. 

1. Overlapping-group-lasso Penalty 

Proposition 3 Let a* , which is composed of {a*}g£g, be the optimal solution to (jlOp 
for overlapping-group-lasso penalty in ^. For any g ^Q, 

where S is the projection operator which projects any vector u to the £2 ball: 

ulb > 1, 



5(u) = <^ ll"H2 

I 1111^1 

U U 2 < 1- 



|C||=7max,.,|i_^} /V ^_^^^.^ (u;,)2. 



^geg s.t. jeg 

2. Graph-guided Fusion Penalty 

Proposition 4 Let a* be the optimal solution of (llOp for graph-guided fusion penalty 
in dH. Then, we have: 

a = S{ ), 

where S is the shrinkage operator defined as follows: 

X, if — 1 < a; < 1 
S{x) = {1, if X > 1 

-1, if X < -1. 

For any vector ex, S{a) is defined as applying S on each and every entry of ex. 



\\C\\ is upper-bounded by yjl'i^ maxjgy dj, where 

d,= J2 (^(^^))' (12) 

e&E s.t. e incident on j 

for j £ V in graph G, and this bound is tight. Note that when T{re) = 1 for all e £ E, 
dj is simply the degree of the node j . 
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3.3 Smoothing Proximal Gradient Descent 

Given the smooth approximation of the non-smooth structured-sparsity-inducing penalties 
as presented in the p revious section, now, we a pply the fast iterative shrinkage-thresholding 
algorithm (FISTA) ( Beck and Teboulld . l2009l ). using the gradient information in Theorem 



[2j We substitute the penalty term 0,{f3) in ([2]) with its smooth approximation ffi{(3) to 
obtain the following optimization problem: 



Let 



mm/(/3) = g{f3) + f^{(3) + A||/3||i. (13) 

h{f3) = gi(3) + f^iP) = l\\y- X/3||2 + f^{(3). 
According to Theorem O the gradient of /i(/3) is given as: 

V/i(/3) = X'^(X/3-y) + C7^Q*. (14) 

Moreover, Vh{f3) is Lipschitz-continuous with the Lipschitz constant: 

L = A„,ax(X^X) + L^ = A„,ax(X^X) + i^, (15) 

where AmiK(X^X) is the largest eigenvalue of (X^X). 

Since f{(3) only involves a very sirnple no n-smooth part (i.e., the £i-norm penalty), we 
can adopt FISTA ( Beck and Teboulld . l2009l ) to minimize /(/3) as shown in Algorithm [TJ 



Algorithm [T] alternates between the sequences {w^} and {/3*} and 9t = -j^ can be viewed as 
a special "step-size", which det ermines the relationship b etween {w^} and {/3*} as in Step 
4 of Algorithm [TJ As shown in ( Beck and Teboulld . |2009J), such a way of setting 9t leads to 



Lemma 1 in Appendix, which further guarantees the convergence result in Theorem [71 

Rewriting ([TUj). we can easily see that it is the proximal operator associated with the 
£i-norm : 

/3*+^ = argminl||/3 - (w* - y V/i(w*))||2 + h(3\\,. 

Let V = (w* — j-V/t(w*)), the clo sed-form solution for /3*^ can be obtained by soft- 
thresholding ( Friedman et al.l . 120071 ) as presented in the next proposition. 



Proposition 5 The closed-form solution of 

mmi||/3-v||2 + Ap||^ 

can be obtained by the soft-thresholding operation: 

Pj = sign(t;j) max(0, \vj\- -), j = 1, . . . , J. (17) 

A notable advantage of utilizing the proximal operator associated with the £i-norm 
penalty as in Step 2 of Algorithm [T] is that it can provide us with the sparse solutions, 
where the coefficients for irrelevant inputs are set exactly to zero, due to the soft-thresholding 
operation in (J17p . At convergence, the values of (3 for the irrelevant groups in the group- 
lasso penalty and edges in the graph-guided fusion penalty will become sufficiently close to 
zeros, and the soft-thresholding operation will truncate them exactly to zeros. 

11 



Chen and Lin and Kim and Carbonell and Xing 



Algorithm 1 Smoothing Proximal Gradient Method for Structured Sparse Regression 

Input: X, y, C, (3^ , desired accuracy e. 
Initialization: set fi = -^j Oq = 1, w*^ = /3 . 
Iterate For t = 0, 1, 2, . . ., until convergence of /3*: 

1. Compute V/i(w*) according to (|14p . 

2. Solve the proximal operator associated with the ^i-norm: 

/3*+i = argmin/i(w*) + (/3 - w*, V/i(w*)) + X\\(3\\i + -||/3 - w*||2 (i6) 

13 2 



3. Set 6*4+1 — 44:3- 



4. Set w*+i = /3*+i + i^0i+i(/3*+^ - /3*^ 
Output: 3 = /3* 



ft 

3t + l 



Remark 6 Algorithm[I\is a general approach that can be applied to any smooth convex loss 
(e.g. logistic loss), with the structured- sparsity-inducing penalty i^{f3) that can be re-written 
in the form of maxc a^Cf3. 

3.4 Convergence Rate and Time Complexity 

Although we optimize the approximation function /(/3) rather than optimizing f{(3) di- 
rectly, it can be proven that the /3 obtained from Algorithm [1] is sufficiently close to the 
optimal solution (3* to the original optimization problem in ([2]) . We present the convergence 
rate of Algorithm [1] in the next theorem. 

Theorem 7 Let (3* be the optimal solution to ^ and /3* be the approximate solution at 
the t-th iteration in Algorithmic If we require /(/3*) — /(/3*) < e and set n = ^, then, the 
number of iterations t is upper-bounded by 




/'°lll^_,x.x, + ?3W, „„ 



A bound similar to (jlSp has been shown in ( Lan et al.l . l201ll ) where the smoothing technique 



is combined with a different first-order algorithm instead of FISTA. The key idea behind 
the proof of this theorem is to decompose /(/3*) — f{(3*) into three parts: (i) /(/3*) — /(/3*), 
(ii) 7(/3*) - 7(/3*), and (ih) 7(/3*) - /(/3*). (i) andXm) can be bounded by the gap of the 
approximatio n fiD. fii) only involves the function / and can be upper bounded by 0{-p) 
as shown in ( Beck and Teboulld . l2009l ) by balancing these three terms. The details of the 



proof are presented in Appendix. We obtain (jlSp . According to Theorem [71 Algorithm [T] 
converges in 0{ ^ ) iterations, which is much faster than the subgradient method with the 
convergence rate of O(^). Note that the convergence rate depends on D through the term 
V2D, and the D depends on the problem size. 
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Table 2: Comparison of Per- iteration Time Complexity 





Overlapping Group Lasso 


Graph-guided Fused Lasso 


Smoothing Proximal Gradient 


0{J^ + T.,eg\g\) 


0{J^ + \E\) 


IPM for SOCP 


0{{J+\Q\?{N + Y.geg\9\)) 


0{{J + \E\)\N + J+\E\)) 



As for the time complexity, assuming that we can pre-compute and store X^X and 
X y with the time complexity of 0{J'^N), the main computational cost in each iteration 
comes from calculating the gradient V/i(w(). As we discussed in the Section^ the existing 
methods for solv ing Q with the i ^ /i2 m ixed-norm based overlapping-group-lasso penalty 
include FOBOS (JDuchi and Singerl . bnOfll ) (or subgradient descent) and IPM for SOCP. The 
first approach has the same per-iteration time complexity as our method but with 0(t) 
convergence rate. For IPM for SOCP, although it converges in fewer iterations (i.e., log(-)), 
its per-iteration complexity is higher by orders of magnitude than ours as shown in Table [21 
Therefore, the smoothing proximal gradient method is much more efficient and scalable for 
large-scale problems. In addition to time complexity, each IPM iteration of SOCP requires 
significantly more memory to store the Newton linear system. 



Remark 8 // we can pre-compute and store XX, the per-iteration time complexity of our 
method is independent of sample size N as shown in TablelM If J is very large, XX may 
not fit into memory. In such a case, we compute X-^(Xw*) for each iteration; and the 
per-iteration time complexity will increase by a factor of N but still less than that for IPM 
for SOCP. For the logistic loss, the per-iteration complexity is also linear in N. 



3.5 Summary and Discussions 

The insight of our work was drawn from two lines of earlier works. Th e first one is th e 
proxima l gradient methods (e.g., Nesterov's composite gradient method (|Nesterovl . 120071 ). 



FISTA ( Beck and Teboulld . l2009l )). They have been widely adopted to solve optimization 



problems with a convex loss and a relatively simple non-smooth penalty, and achieve 0{-^) 
convergence rate. However, the complex structure of the non-separable penalties considered 
in this paper makes it intractable to solve the proximal operator exactly. This is the 
challenge that we circumvent via smoothing. 



The idea of smoothing th e non-smooth fu nction was first discussed in (|Nesterovl . |2005| ) . 
The algorithm presented in (JNesterovl . l2005l ) only works for smooth problems so that it 
has to smooth out the entire non-smooth penalty (including the ^i-norm). However, it is 
precisely the non- smoothness of the penalty that leads to exact zeros in optimal solutions. 
Therefore, although widely adopted in optimization field, this approach cannot yield exact 
zero solution and leads the probl em of where to truncate the solution to zero. Moreover, 
the algorithm in (Nesterovj, |2005| ) requires the condition that /3 is bounded and that the 
number of iterations is pre-defined, which are impractical for real applications. Instead, our 
approach combines the smoothing idea with the proximal gradient method and hence leads 
to the exact sparse solutions. 
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Inputs _ _ 

(SNPs) ACG TTTAC GTA AA TTAC 

Outputs 
(phenotypes) 




Figure 2: Illustration of the multivariate regression with graph structure on outputs. 



As for the convergence rate, the gap between 0(7) and the optimal rate 0(4=) is due 
to the approximation of the structured-sparsity-inducing penalty. It is possible to show that 
if XX is a positive definite (PD) matrix, 0{-i=) can be achieved by a variant of excessive 
gap method ( Nesterovl . l2003l ) . However, such a rate can not be easily obtained for sparse 
regression problems where J > N (i.e., XX is not PD). For some special cases as discussed 
in Section [21 such as tree-structured or the £i/£oo mixed-norm based overlapping groups, 
0{-^) can be achieved at the expense of more computation time for solving the proximal 
operator. It remains an open question whether we can further boost the generally-applicable 
smoothing proximal gradient method to achieve 0{—^). 



4. Extensions for Multivariate Regression 

The structured-sparsity-inducing penalties as disc ussed in the pre v ious section can be siin - 
ilarly used in the multivariate regression setting ( Kim and Xingi . 120101 : iKim et al.l . 12003 ) 
where the prior structural information is available for the outputs. For example, in genetic 
association analysis, where the goal is to discover few genetic variants or single neucleotide 
polymorphisms (SNPs) out of millions of SNPs (inputs) that influence phenotypes (out- 
puts) such as gene expression measurements, the correlation structure of phenotypes can 
be naturally represented as a graph, which can be used to guide the selection of SNPs as 
shown in Figure [2j Then, the graph-guided fusion penalty can be used to identify SNPs 
that are relevant jointly to multiple related phenotypes. 

While in the multivariate regression problems, we encounter the same difficulties of 
optimizing with non-smooth and non-separable penalties as in the previous section, the 
smoothing proximal gradient method can be extended to this problem in a straightforward 
manner. Due to the importance of applications, in this section, we briefly discuss how 
our method can be applied to the multivariate regression with structured-sparsity-inducing 
penalties. 



4.1 Multivariate Linear Regression Regularized by 
Structured-sparsity-inducing Penalties 

Let X G M^^"^ denote the matrix of input data for J inputs and Y G M^^"- denote the 
matrix of output data for K outputs over A^ samples. We assume a linear regression model 
for each of the k-th output: yk = X/3;., + e^, \/k = 1, . . . K, where (3f. = [/3ifc, • • . , Pjk]'^ 
is the regression coefficient vector for the A;-th output and e^ is Gaussian noise. Let B = 



[Pi,...,Pk]^ 



IxK 



be the matrix of regression coefficients for all of the K outputs. Then, 
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the multivariate structured sparse regression problem can be naturally formulated as the 
following optimization problem: 

min /(B) = -||Y-XB|||, + 0(B) + A||B||i, (19) 

where || • \\f denotes the matrix Probenius norm, || • ||i denotes the matrix entry-wise ii 
norm, and il(B) is a structured-sparsity-inducing penalty with the structure over outputs. 

1. Overlapping-group- lasso Penalty in Multivariate Regression: We define the 
overlapping-group-lasso penalty for the in multivariate regression as follows: 

J 

f](B) = 7^j;^,||/3^.J|2, (20) 

i=i see 

where Q = {gi, . . . ,g\g\} is a subset of the power set of {1, . . . ,K} and f3jg is the 
vector of regression coe fficients {0jk,k G q}. Both £i/i2 mixed-norm penalty for 
multivariate regressio n (jObozinski et all l2009l ) and tree-guided group-lasso penalty 



( Kim and XineJ . 120101 ) are special cases of (pOj) . 



2. Graph-guided Fusion Penalty in Multivariate Regression:Assuming the graph 

structure over the K outputs is given as G with a set of nodes V = { 1 , . . . , K} and 

a set of edges E, the graph-guided fusion penalty for multivariate regression is given 

as: 

J 

f](B)=7 Yl r{rm.l)J2\fijm-sign{rmi)l3ji\. (21) 

e={m,l)eE j=l 

4.2 Smoothing Proximal Gradient Descent 

Using the similar techniques in Section [3. H 0,(B) can be reformulated as: 

n(B) = max(CB'^, A), (22) 

AeQ 

where (U, V) = Tr(U-^V) denotes a matrix inner product. C is constructed in the similar 
way as in ([7]) or ([8]) just by replacing the index of the input variables with the output 
variables, and A is the matrix of auxiliary variables. 

Then we introduce the smooth approximation of l\22\} : 

/^(B) = max {{CB^, A) - /xd(A)) , (23) 

where d{A) = 2||A|||,. Following a proof strategy similar to that in Theorem [21 we can 
show that /^t(B) is convex and smooth with gradient V/^(B) = (A*)-^C, where A* is the 
optimal solution to (I23p . The closed- form solution of A* and the Lipschitz constant for 
V/^(B) can be derived in the same way. 

By substituting 0(B) in (fT9]) with f^(B), we can adopt Algorithm [1] to solve (fT9|) with 
convergence rate of O(-). The per-iteration time complexity of our method is 0{J^K + 
JJ2q£g bl) f°^ overlapping group lasso and 0{J^K + J\E\) for graph-guided fused lasso. 
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5. Experiment 

In this section, we evaluate the scalabiUty and efficiency of the smoothing proximal gradient 
method (Prox-Grad) and apply Prox-Grad on overlapping group lasso to a real genetic 
dataset. 



For overlapping group lasso, we compare the Prox-Grad with the FOBOS (jDuchi and Singer . 



2OO9I ) and IPM for SOCPo For grap h-guided fused las s o, we compare the running time of 



Prox-Grad with that of the FOBOS (IPuchi and Singerl . I2OO9I ) and IPM for QpEI Note that 



for FOBOS, we set the "loss function" to be l{l3) = g{f3) + n{P) and the penalty to A||/3||i. 
As discussed in Section [2l for the non-smooth 1{I3), FOBOS achieves O (^) convergence 
rate, which is slower than Prox-Grad. 

All experiments are performed on a standard PC with 4GB RAM and the software is 
written in MATLAB. The main difficulty in comparisons is a fair stopping criterion. Unlike 
IPM, Prox-Grad and FOBOS do not generate a dual solution, and therefore, it is not possible 
to compute a primal-dual gap, which is the traditional stopping criterion for IPM. Here, we 
adopt a widely used approach for comparing different methods in optimization literature. 
Since it is well known that IPM usually gives more accurate (i.e., lower) objective, we set the 
objective obtained from IPM as the optimal objective value and stop the first-order methods 
when the objective is below 1.001 times the optimal objective value. For large datasets for 
which IPM cannot be applied, we stop the first-order methods when the relative change in 
the objective is below 10^^. In addition, we set the maximum iterations to be 20,000. 

In simulation data, we constrain the regularization parameters such that A = 7 and 
we assume that for each group g, Wg = 1. As for the smoothing parameter /i, ^u is set 
to 27J according to Theorem [TJ where D is determined by the problem size. It is natural 
that for large-scale problems with large D, a larger e can be adopted without affecting 
the recovery quality significantly. Therefore, instead of setting e, we directly set fi = 10~^, 
which provided us reasonably good approximation accuracies for different scales of problems 
based on our experience for a range of u in simulatio ns. As for FOBOS, we set the stepsize 
rate to -% as suggested in ( Duchi and Singeri . l2009l ). where c is set to -7= for univariate 



/NJ 



regression and ,^ for multivariate regression lf| 

5.1 Simulation Study: Overlapping Group Lasso 

We simulate data for a univariate linear regression model with the overlapping group struc- 
ture as described below. Assuming that the inputs are ordered, we define a sequence of 
groups of 100 adjacent inputs with an overlap of 10 variables between two successive groups 
so that g = {{1, . . . , 100}, {91, . . . , 190}, . . . , { J - 99, . . . , J}} with J = 90|^| + 10. We set 
Pj = (—1)-' exp(— (j — 1)/100) for 1 < j < J. We sample each element of X from i.i.d. Gaus- 
sian distribution, and generate the output data from y = X/3 -|- e, where e ~ N{0,Ij\[xn)- 



3. We use the state-of-the-art MATLAB package SDPT3 (|Tutuncu et all . 120031 ') for SOCP. 

4. We use the commercial package MOSEK (jMOSl) for QP. Graph-guided fused lasso can also be solved by 
SOCP but it is less efficient than QP. 

5. It is known that, empirically, a large c may lead to the failure of the convergence but a smaller value may 
make the optimization slow. According to our own experience with a range of values for c, such a way 
of setting c guarantees the convergence for all different scales of the problems studied in our experiment 
and provides us reasonably good empirical convergence speed. 
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Table 3: Comparisons of different optimization methods on the overlapping group lasso 



\Q 


= 10 


iV=l,000 


Ar=5,000 


iV=10,000 


(J = 910) 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


7 = 2 


SOCP 


103.71 


266.68 


493.08 


917.13 


3777.46 


1765.52 


FOBOS 


27.12 


266.95 


1.71 


918.02 


1.48 


1765.61 


Prox-Grad 


0.87 


266.95 


0.71 


917.46 


1.28 


1765.69 


7 = 0.5 


SOCP 


106.02 


83.30 


510.56 


745.10 


3585.77 


1596.42 


FOBOS 


32.44 


82.99 


4.98 


745.79 


4.65 


1597.53 


Prox-Grad 


0.42 


83.39 


0.41 


745.10 


0.69 


1596.45 


1^1=50 
(J = 4510) 


Ar=l 


,000 


7V=5,000 


iV=10,000 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


7 = 10 


SOCP 


4144.20 


1089.01 


- 


- 


- 


- 


FOBOS 


476.91 


1191.05 


394.75 


1533.31 


79.82 


2263.49 


Prox-Grad 


56.35 


1089.05 


77.61 


1533.32 


78.90 


2263.60 


7 = 2.5 


SOCP 


3746.43 


277.91 


- 


- 


- 


- 


FOBOS 


478.62 


286.33 


867.94 


559.25 


183.72 


1266.73 


Prox-Grad 


33.09 


277.94 


30.13 


504.34 


26.74 


1266.72 


1^1 = 100 
(J = 9010) 


Ar=l 


,000 


iV=5 


,000 


A^=10,000 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


CPU (s) 


Obj. 


7 = 20 


FOBOS 


1336.72 


2090.81 


2261.36 


3132.13 


1091.20 


3278.20 


Prox-Grad 


234.71 


2090.79 


225.28 


2692.98 


368.52 


3278.22 


7 = 5 


FOBOS 


1689.69 


564.21 


2287.11 


1302.55 


3342.61 


1185.66 


Prox-Grad 


169.61 


541.61 


192.92 


736.56 


176.72 


1114.93 



To demonstrate the efficiency and scalability of Prox-Grad, we vary J, A^ and 7 and 
report the total CPU time in seconds and the objective value in Table [3l The regularization 
parameter 7 is set to either |^|/5 or |^|/20. As we can see from Table [3l firstly, both 
Prox-Grad and FOBOS are more efficient and scalable by orders of magnitude than IPM 
for SOCP. For larger J and N ^ we are unable to collect the results of SOCP. Secondly, 
Prox-Grad is more efficient than FOBOS for almost all different scales of the problemsO 
Thirdly, for Prox-Grad, a smaller 7 leads to faster convergence. This result is consistent 
with Theorem [7] which shows that the number of iterations is linear in 7 through the term 
||C||. Moreover, we notice that a larger N may not increase the computational time for 



In some entries in Table O the Obj. from FOBOS is much larger than other methods. This is because 
that FOBOS has reached the maximum number of iterations before convergence. 
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Figure 3: Regression coefficients estimated by different methods based on a single simulated 
dataset. b = 0.8 and threshold p — 0.3 for the output correlation graph are used. 
Red pixels indicate large values, (a) The correlation coefficient matrix of phenotypes, 
(b) the edges of the phenotype correlation graph obtained at threshold 0.3 are shown as 
black pixels, (c) the true regression coefficients used in simulation. Absolute values of the 
estimated regression coefficients are shown for (d) lasso, (e) £1/^2 regularized multivariate 
regression, (f) GFlasso. Rows correspond to outputs and columns to inputs. 
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Figure 4: Comparisons of Prox-Grad, FOBOS and QP. (a) Vary K from 50 to 10, 000, fixing N = 
500, J = 100; (b) Vary J from 50 to 10, 000, fixing N = 1000, K = 50; and (c) Vary N 
from 500 to 10000, fixing J = 100, K = 50. 



Prox-Grad. This is also consistent with the time complexity analysis, which shows that for 
linear regression, the per-iteration time complexity is independent of A^. 



5.2 Simulation Study: Multivariate Graph-guided Fused Lasso 

We simulate data using the following scenario analogous to the problem of genetic associ- 
ation mapping, where we are interested in identifying a small number of genetic variations 
(inputs) that influence the phenotypes (outputs). We use i^ = 10, J = 30 and N = 100. 
To simulate the input data, we use the genotypes of the 60 individuals fr om the parents 
of the HapMap CEU panel ( The International HapMap Consortium! . |2005| ). and generate 
genotypes for additional 40 individuals by randomly mating the original 60 individuals. 
We generate the regression coefficients /3yfc's such that the outputs y^'s are correlated with 
a block-like structure in the correlation matrix. We first choose input-output pairs with 
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non-zero regression coefficients as we describe below. We assume three groups of correlated 
output variables of sizes 3, 3, and 4. We randomly select inputs that are relevant jointly 
among the outputs within each group, and select additional inputs relevant across multiple 
groups to model the situation of a higher-level correlation structure across two subgraphs as 
in Figure Ella) . Given the sparsity pattern of B, we set all non-zero /3jj to a constant b = 0.8 
to construct the true coefficient matrix B. Then, we simulate output data based on the 
linear regression model with noise distributed as standard Gaussian, using the simulated 
genotypes as inputs. We threshold the output correlation matrix in Figure [3l^a) at p = 0.3 
to obtain the graph in Figure [3l^b), and use this graph as prior structural information for 
graph-guided fused lasso. As an illustrative example, the estimated regression coefficients 
from different methods are shown in Figures [3^d)-(f). While the results of lasso and ^1/^2- 
regularized multi-task regression in Figures [3|^d) and (e) contain many false positives, the 
results from graph-guided fused lasso in Figure [3l^f ) show fewer false positives and reveal 
clear block structures. Thus, graph-guided fused lasso outperforms the other methods. 

To compare Prox-Grad with FOBOS and IPM for QP, we vary K, J, N, and present the 
computation time in seconds in Figures |31^a)-(c), respectively. We select the regularization 
parameter 7 using the separated validation dataset, and report the CPU time for GFlasso 
with the selected 7. The input, output and true regression coefficient B are generated in 
the way similar as above. More precisely, we assume that each group of correlated output 
variables is of size 10. For each group of the outputs, We randomly select 10% of the input 
variables as relevant. In addition, we randomly select 5% of the input variables as relevant 
to every two consecutive groups of outputs and 1% of the input variables as relevant to every 
three consecutive groups. We set the p for each dataset so that the number of edges is 5 times 
the number of the nodes (i.e. \E\ = 5K). Figure |4] shows that Prox-Grad is substantially 
more efficient and can scale up to very high-dimensional and large-scale datasets. Moreover, 
we notice that the increase of A^ almost does not affect the computation time of Prox-Grad, 
which is consistent with the complexity analysis in Section [37 



5.3 Real Data: Pathvi^ay Analysis of Breast Cancer Data 

In this section, we apply the Prox-Grad with the ov er lapping-group-lasso penalty to a real- 
world dataset collected from breast cancer tumors ( van de Vijver et al.l . I2OO2I : Ijacob et al 



2OO9I ) and solve it by the smoothing proximal gradient method. The data are given as gene 



expression measurements for 8,141 genes in 296 breast-cancer tumors (78 metastatic and 
217 non- metastatic), and the task is to select a small amount of the most relevant genes 
that gives the best prediction performance. 

Thus, a powerful way of discovering genes involved in a tumor growth is to consider 
groups of interact i ng ge nes in each pathway rather than individual genes independently 
( Ma and Kosorokl . l20ld ). The overlapping-group-lasso penalty provides us with a natural 



way to incorporate these known pathway information into the biological analysis, where each 
group consists of the genes in each pathway. This approach can allow us to find pathway- 
level gene groups of significance that can distinguish the two tumor types. In our analysis of 
the breast cancer dat a, we cluster the genes usiii g the canonical pathways from the Molecular 



Signatures Database (jSubramanian et al.l . l2005l ) , and construct the overlapping-group-lasso 



penalty using the pathway-based clusters as groups. Many of the groups overlap because 

19 



Chen and Lin and Kim and Carbonell and Xing 




500 1000 1500 2000 

Number of Variables (Genes) 



(a) 



-Overlapping-group-lasso penalty 
- LI -norm penalty 




500 1000 1500 2000 

Number of Variables (Genes) 



(b) 



Figure 5: Results from the analysis of breast cancer dataset. (a) Balanced error rate for varying 
the number of selected genes, and (b) the number of pathways for varying the number of 
selected genes. 



genes can participate in multiple pathways. Overall, we obtain 637 pathways over 3,510 
genes, with each pathway containing 23.47 genes on average and each gene appearing in 
four pathways on average. Then, we set up the optimization problem of minimizing the 
logistic loss with the overlapping-group-lasso penalty to classify the tumor types based on 
the gene expression levels, and solve it with the Prox-Grad method. 

Since the number of positive and negative samples are imbalanced, we adopt the bal- 
anced error rate defined as the average error rate of the two classes |I| We split the data into 
the training and testing sets with the ratio of 2:1, and vary the A = 7 from large to small 
to obtain the full regularization path. 

In FigureEl we compare the results from fitting the logistic regression with the overlapping 
group-lasso penalty and the model with the ^i-norm penalty. Figure[5ja) shows the balanced 
error rates for the different numbers of selected genes along the regularization path. As we 
can see, the balanced error rate for the model with the overlapping-group-lasso penalty 
is lower than the one with the ^i-norm, especially when the number of selected genes is 
between 500 to 1000. The model with the overlapping-group-lasso penalty achieves the 
best error rate of 29.23% when 696 genes are selected, and these 696 genes belong to 125 
different pathways. In Figure [5]^b), for the different numbers of selected genes, we show 
the number of pathways to which the selected genes belong. From Figure \Mi>) , we see that 
when the group structure information is incorporated, fewer pathways are selected. This 
indicates that regression with the overlapping-group-lasso penalty selects the genes at the 
pathway level as a functionally coherent groups, leading to an easy interpretation for func- 
tional analysis. On the other hand, the genes selected via the ^i-norm penalty are scattered 
across many pathways as genes are considered independently for selection. The total com- 
putational time for computing the whole regularization path with 20 different values for the 
regularization parameters is 331 seconds for the overlapping group lasso. 

We perform a funct ional enrichrnent an alysis on the selected pathways, using the func- 
tional annotation tool ( Huang et al.l . l2009l ) . and verify that the selected pathways are sig- 



7. See |http : //www . modelselect . inf . ethz . ch/evaluat ion . php | for more details 
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nificant in their relevance to the breast-cancer tumor types. For example, in a highly sparse 
model obtained with the group-lasso penalty at the very left end of Figure EJ^b) , the se- 
lected gene markers belong to only seven pathways, and many of these pathways appear to 
be reasonable candidates for an involvement in breast cancer. For instance, all proteins in 
one of the selected pathways are involved in the activity of proteases whose function is to 
degrade unnecessary or damaged proteins through a chemical reaction that breaks peptide 
bonds. One of the most important malignant properties of cancer involves the uncontrolled 
growth of a group of cells, and protease inhibitors, which degrade misfolded proteins, have 
been extensively studied in the treatment of cancer. Another interesting pathway selected 
by overlapping group lasso is known for its involvement in nicotinate and nicotinamide 
metabol ism. This pathway has been confirmed as a marker for breast cancer in previous 
studies ( Ma and Kosorokl . l20ld ). In particular, the gene ENPPl (ectonucleotide pyrophos- 



phatase/phosphodiesterase l)in_this pathway has been found to be overly expressed in 



breast tumors (lAbate et al.l . l2005l ). Other selected pathways include the one related to 
ribosomes and another related to DNA polymerase, which are critical in the process of gen- 
erating proteins from DNA and relevant to the property of uncontrolled growth in cancer 
cells. 

We also examine the number of selected pathways that gives the lowest error rate in 
FigureO At the error rate of 29.23%, 125 pathways (696 genes) are selected. It is interesting 
to notice that among these 125 pathways, one is closely related to apotosis, which is the 
process of programmed cell death that occurs in multicellular organisms and is widely known 
to be involved in un-controlled tumor growth in cancer. Another pathway involves the genes 
BRCAl, BRCA2, and ATR, which have all been associated with cancer susceptibility. 

For comparison, we examine the genes selected with the £i-norm penalty that does not 
consider the pathway information. In this case, we do not find any meaningful functional 
enrichment signals that are relevant to breast cancer. For example, among the 582 pathways 
that involve 687 genes at 37.55% error rate, we find two large pathways with functional 
enrichments, namely response to organic substance (83 genes with p-value 3.3E-13) and the 
process of oxidation reduction (73 genes with p-value 1.7E-11). However, both are quite 
large groups and matched to relatively high-level biological processes that do not provide 
much insight on cancer-specific pathways. 

6. Conclusions 

In this paper, we considered an optimization problem for estimating a structured-sparsity 
pattern in regression coefficients with a general class of structured-sparsity-inducing penal- 
ties. Many of the structured-sparsity-inducing penalties including the overlapping-group- 
lasso penalties and graph-guided fusion penalty share a common set of difficulties in opti- 
mization such as non-separability and non-smoothness. We showed that the optimization 
problems with these penalties can be transformed into the common form, and proposed a 
general optimization approach called smoothing proximal gradient method that can be ap- 
plied to an optimization problem of this common form. Our results show that the proposed 
method can efficiently solve high-dimensional problems. 

As for the future work, it is known that reducing /i over iterations leads to better 
empirical results. However, in such a scenario, the convergence rate is harder to analyze. In 
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addition, since the method is only based on gradient, its onhne version with the stochastic 
gradient descent can be easily derived. However, proving the regret bound will require a 
more careful investigation. 

7. Appendix 

7.1 Proof of Theorem [2] 

We first introduce the concept of Fenchel Conjugate. 

Definition 9 The Fenchel conjugate of a function (^(a) is the function ^*{fi) defined as: 

^*{P)= sup {cx^P-^icx)). 

ct^dom{ip) 



Recall that d{a.) = 2ll'*lP with the dom(Q:) = Q. According to Definition [9l the 
conjugate of d{-) at — ^ is: 

C(3\ __ f ^tCP 



sup I a d{cx) 

1^ J otdQ V t^ 



and hence 



( C3\ 
f^{/3) = argmax [oFC(3 — iid{cx)) = ^d* I I 



According to Theorem 26.3 in ( Rockafellarl . Il996l ) "a closed proper convex function is es- 



sentially strictly convex if and only if its conjugate is essentially smooth", since d{oL) is 
a closely proper strictly convex function, its conjugate is smooth. Therefore, //i(/3) is a 

smooth function. 

Now we apply Danskin's Theorem (Prop B.25 in ( Bertsekaa . 1 19991 )) to derive Vffj.{(3). 



Let (j){a,P) = cx^ C (5 — iid{a) . Since d{-) is a strongly convex function, argmax^^gg (/)(q;,/3) 
has a unique optimal solution and we denote it as a* . According to Danskin's Theorem: 

V^(/3) = V^(/>(a*,/3) = C'^cx*. (24) 

As for the proof of Lipschitz constant of f^{P), readers may refer to ( Nesterovl . |2005| ). 
7.2 Proof of Proposition [3] 

a* = argmax (Q^C/3--||a||^) (25) 

argniax^ (^-/WgO^fBg - -||Qg||i 



"es ;^ ^ - - 2 



arg min > 



9 ||2 



"^2 geg ^' 
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Therefore, ()25p can be decomposed into \G\ independent problems: each one is the Euchdean 
projection onto the ^2-ball: 



■ II T^9/^3||2 

a = argmm \\ag ^||2, 

ag:\\ag\\2<l /^ 



and a* 



(a* V (a* )^ 



According to the property of ^2-t)an, it can be easily shown that: 



/^ 



||u||2 II 112 

II II ^1 

U ||u||2 < 1. 



As for ||C7||, 



V see j&g 



\ 



i=l \9&G s.t. jeg 



the maximum value of ||Cv||2, given ||v||2 < 1, can be achieved by setting m for j cor- 
responding to the largest summation J2neg st j&qi'^a)'^ *° ^^^-^ ^^"^ setting other Uj's to 
zeros. Hence, we have 



Cv||2 = 7 max \ jS^ ^ {w„) 

7e|i,...,J} V ^^9&5 s.t. jes^ ^^ 



7.3 Proof of Proposition [4] 

Similar to the proof technique of Proposition 1, we reformulate the problem of solving a* 
as a Euclidean projection: 

a* = argmaxfo; C(3 ||q;||2 

= argmm \\a 1|2, 

":||a||oo<l /^ 

and the optimal solution a* can be obtained by projecting — onto the £oo-ball- 
According to the construction of matrix C, we have for any vector v: 

l|Cv||2=7^ ^ {T{rmi))^{vm -sign{r mi)vif (26) 

e=(m,l)GE 

By the simple fact that (a it 6)^ < 2a^ + 25^ and the inequality holds as equality if and 
only if a = ±b, for each edge e = {m, I) G E, the value {vm — sign{rmi)vi)'^ is upper bounded 
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by 2v^ + 2vf. Hence, when ||v||2 = 1, the right-hand side of (j26p can be further bounded 
by: 



ICvl 



< 



7'Ee=(,n,0ei?2(r(r„0)'(< + ^r) 



— 7 Z/j6v(Se incident on k '^VVe)) )^j 

< 27^ maxjgy dj, 



(27) 



where 



Therefore, we have 



d, 



\C\\ 



eS-E s.i. e incident on j 



max llCvlh < , /27^ maxc?,-. 

I|vi|2<l V J'6^ 



Note that this upper bound is tight because the first inequahty in (|27p is tight. 

7.4 Proof of Theorem [7] 

This proof fohows the sin iilar scheme used in ( Lan et al.1 . 1201 ll ). Based on the result from 
( Beck and Tebouhd . 12003 ) . we have the fohowing lemma: 

Lemma 10 For the function /(/3) = h{(3) + A||/3||i; where h{(3) is an arbitrary convex 
smooth function and its gradient 'Vh{(3) is Lipschitz continuous with the Lipschitz constant 
L. We apply Algorithm[l\ to minimize f{(3) and let /3 be the approximate solution at the 
t-th iteration. For any (3, we have the following bound: 



f{f3') - fi(3) < 



2L\\(3-f3X 

t2 



(28) 



In order to use the bound in (I28p . we decompose f{(3) — /(/3*) into three terms: 

/(/3*) - /(/3*) = (/(/3*) - 7(/3*)) + (7(/3*) - 7(/3*)) + (/(/3*) - /(/3*)) . (29) 

According to the definition of /, we know that for any (3 

m<fi(3)<m+fiD, 

where D = uiayicteQ d{ct) . Therefore, the first term in ()29p . /(/3 ) — f{(3 ), is upper-bounded 
by ^D, and the last term in (p9]) is less than or equal to (i.e., /(/3*) — /(/3*) < 0). 
Combining (j28p with these two simple bounds, we have: 



'0||2 



f{f3^)-f((3*)<fsD + 



2L\\f3*-/3^\\l^^^^2\\f3*-(3'n, 



Amax(X X) + 



\\c\\' 



By setting /U = 2IJ ^^^ plugging this into the right-hand side of ([30]) . we obtain 

2D\\C\r 



i*\\2 



fi(3') - m) < I + ^^ ( A^ax (X^X) + 



(30) 



(31) 



If we require the right-hand side of (I3ip to be equal to e and solve it for t, we obtain the 
bound of t in (Ull). 
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